This example shows how to solve a common data prep problem - how to convert a wide table to a tidy narrow table. The wide data form is common in spread sheets - especially those representing time series.

The Bureau of Labor Statistics (the BLS) data page, has many data sets and many ways to access them. For a research project we needed the monthly employment figures by "metro areas." Using the Multi-screen Data Search tool for Employment, Hours, and Earnings - State and Metro Area we captured the non-farm employment from 1995 through 2015 as a tab seperated file. See the BLS sm file spec

There is a bit clean-up to do to make it readable with read_tsv() from Hadley's readr package. See the function BLS_CleanRawTextFile() included in this package if you are interested in the details (which are not relevant for this dplyr example).

Read the BLS file into the data frame employment and take a glimpse of the first and last ten columns

library(dplyr)
library(tidyr)
library(readr)
library(dplyrExamples)
library(stringr)
library(ggplot2)
library(lubridate)
fn <- system.file("extdata", "BLS_NonFarmEmploymentInAreas_1995_2015.tsv",
                  package = "dplyrExamples")
file_out <- "file_out.tsv"
BLS_CleanRawTextFile(fn, file_out)  ## Clean up the BLS raw text file
employment <- read_tsv(file_out)
dim(employment)
glimpse(employment[1:10])
glimpse(employment[244:253])

There are two challenges with this data set:

  1. The Series_ID needs to be decoded to pull out the state and area for the row.
  2. It is very wide - 253 columns! We wish to tidy it up, in the Hadley sense, so each row will just a single numeric column, the number of non-farm employees in the area for the month.

The series_id decoder is in Section 5 of the BLS SM file spec and is repeated here:

0        1         2    <--
12345678901234567890    <-- character counter 
SMU01266207072200001    <-- Sample Series ID

Positions Code                          Value

1-2       survey abbreviation   =       SM
3         seasonal (code)       =       U
4-5       state_code            =       01
6-10      area_code             =       26620
11-12     supersector_code      =       70
13-18     industry_code         =       70722000
19-20     data_type_code        =       01

We did not need NIAC's code breakdown so positions 11-18 are all zeros in our extract.

The BLS has standard lookup files for the state and area codes which we load now.

Read State & Area Code Tables and take a glimpse

fac <- system.file("extdata", "BLS_AreaCodes.tsv", package = "dplyrExamples")
area_codes <- read_tsv(fac)
fsc <- system.file("extdata", "BLS_StateCodes.tsv", package = "dplyrExamples")
state_codes <- read_tsv(fsc)
glimpse(area_codes)
glimpse(state_codes)

Now we have everything we need to tidy up our data.

Tidy Up Area Employment by Month

This is the dplyr sequence you would use in production. The next section breaks down the process step-by-step.

Employment_By_Area_1995_2015 <- employment %>%
  gather(mmm_yyyy, NonFarm_000, -Series_ID) %>%
  mutate(Month_Of = as.Date(paste0("01_", mmm_yyyy), format = "%d_%b_%Y"),
         state_code = str_sub(Series_ID, 4, 5),
         area_code = str_sub(Series_ID, 6, 10)) %>%
  left_join(state_codes) %>%
  left_join(area_codes) %>%
  mutate_each(funs(factor), ends_with("name")) %>% 
  rename(State = state_name, Area = area_name) %>%
  select(State, Area, Month_Of, NonFarm_000) %>%
  arrange(State, Area, Month_Of)
glimpse(Employment_By_Area_1995_2015)

Step-by-step dplyr

Convert from wide to narrow using tidyr::gather().
eba <- employment %>% 
  gather(mmm_yyyy, NonFarm_000, -Series_ID)
glimpse(eba)
Convert character mmm_yyyy to Date and pull out state & area codes from Series_ID.
eba <- eba %>% 
mutate(Month_Of = as.Date(paste0("01_", mmm_yyyy), format = "%d_%b_%Y"),
       state_code = str_sub(Series_ID, 4, 5),
       area_code = str_sub(Series_ID, 6, 10))
glimpse(eba)  
Look-up the state and area names from the codes data frames.
eba <- eba %>% 
  left_join(state_codes) %>%
  left_join(area_codes)
glimpse(eba)
Convert the names to factors and rename them to user friendly names.
eba <- eba %>% 
  mutate_each(funs(factor), ends_with("name")) %>% 
  rename(State = state_name, Area = area_name)
glimpse(eba)
Keep just final columns and sort by month within area within state.
eba <- eba %>% 
  select(State, Area, Month_Of, NonFarm_000) %>%
  arrange(State, Area, Month_Of)
glimpse(eba)

EDA is simple now that we have tidy data!

Here are a cople of examples...

  1. Overall summary.
  2. Plot New Mexico monthly employment by year from 2004 through 2015.
summary(Employment_By_Area_1995_2015)

# pull out New Mexico data
AreasInNM<- Employment_By_Area_1995_2015 %>% 
  filter(State == "New Mexico",
         Month_Of >= as.Date("2004-01-01")) %>% 
  mutate(Year = factor(year(Month_Of)),
         Month = month(Month_Of)) %>% 
  droplevels()

# get month labels for plot
months <- month(seq(as.Date("2000/1/1"), by = "month", length.out = 12), 
                label = TRUE, abbr = TRUE)

ggplot(AreasInNM, aes(Month, NonFarm_000, color = Year)) + 
  geom_point(size = I(1)) + geom_line() + 
  scale_x_continuous(breaks = 1:12, labels = months) + 
  ggtitle("Employment (000) by Year for Areas in New Mexico") + 
  facet_grid(Area ~ ., scales = "free_y" )

Learning More

The place to start, of course, is Hadley's vignettes in the dplyr and tidy packages. Especially Introduction to dplyr and Tidy Data.

Now that Hadley is with RStudio, search their blog for dplyr and tidyr; get the Data Wrangling Cheat Sheet; watch Data Wrangling with R & RStudio. To understand Hadley's current thinking about data analysis watch Pipelines for Data Analysis in R and The Grammar and Graphics of Data Science - the latter with Winston Chang.

Lastly, see Garrett & Hadley's chapter on data transform in their upcoming R for Data Science

Conclusion

We hope you have found this example of using dplyr and tidyr useful. Please send comments and suggestions to Jim at DS4CI.org or leave an issue or pull request at my github.

Thanks! Jim

P.S. Don't forget to clean house.
file.remove(file_out)

END



ds4ci/dplyrExamples documentation built on May 15, 2019, 2:56 p.m.